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ABSTRACT 


Although web crawlers have been around for twenty years 
by now, there is virtually no freely available, open-source 
crawling software that guarantees high throughput, over- 
comes the limits of single-machine tools and at the same 
time scales linearly with the amount of resources available. 
This paper aims at filling this gap. 


Categories and Subject Descriptors 


H.3.4 [Information storage and retrieval]: Systems and 
software—World Wide Web (WWW) 


1. INTRODUCTION 


A web crawler is asystem that downloads systematically a 
large number of web pages starting from a seed and following 
hypertextual links. In this paper we describe the design and 
implementation of BUbiNG, our new web crawler built upon 
the experience with UbiCrawler [1] and on the last ten years 
of research on the topic. BUbiNG main features are the 
following: 


e It is pure Java and open source, released under the 
GNU GPLv3+ and available at the LAW web site. 

e It is fully distributed: multiple agents perform the 
crawl concurrently and handle the necessary coordi- 
nation without the need of any central control; given 
enough bandwidth, the crawling speed grows linearly 
with the number of agents. 

e Its design acknowledges that CPUs and OS kernels 
have become extremely efficient in handling a large 
number of threads, and that large amounts of RAM 
are by now easily available at a moderate cost. 
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e It is very fast: on a 64-core, 64GB workstation it can 
download hundreds of million of pages at more than 
9000 pages per second respecting politeness, analyz- 
ing, compressing and storing more than 140 MB/s of 
data. 

e It guarantees that politeness intervals are satisfied both 
at the host and at the IP level, that is, that two data 
requests to the same host or IP are separated by at 
least a specified amount of time. The two intervals can 
be set independently, and, in principle, customized per 
host or IP. 

e It guarantees that hostwise the visit is breadth first, 
and that also the global behavior is as close as pos- 
sible to a breadth-first visit, taking politeness limits 
into account; moreover, the global policy can be easily 
customized. 


For more details about previous works or the main issues 
in the design of crawlers, we refer the reader to [5]. 


2. DESIGN HIGHLIGHTS 


BUbiNG stands on a few architectural choices which in 
some cases contrast the common folklore wisdom. We took 
our decisions after carefully benchmarking several options 
and gathering the hands-on experience of similar projects. 


e The fetching logic of BUbiNG is built around thou- 
sands of identical fetching threads performing essen- 
tially only synchronous (blocking) I/O. Experience with 
recent Linux kernels and increase in the number of 
cores per machine shows that this approach consis- 
tently outperforms asynchronous I/O. 

e Lock-free [3] data structures are used to “sandwich” 
fetching threads, so that they never have to access 
lock-based data structures. This approach is partic- 
ularly useful to avoid direct access to synchronized 
data structures with logarithmic modification time, as 
contention between fetching threads can become very 
significant. Such structures are accessed by a single 
thread that enqueues the result of the slow-access op- 
eration to a lock-free queue, where any fetching thread 
can pick it up quickly. 

e URL storage (both in memory and on disk) is entirely 
performed using byte arrays. While this approach 
might seem anachronistic, it pays off in terms of foot- 
print (a String instance can occupy three times the 
memory of the corresponding byte array) and in terms 
of number of created objects. 
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Table 1: Comparison between BUbiNG and the main existing open-source crawlers. Resources are HTML 
pages for ClueWeb09 and IRLBot, but include other data types (e.g., images) for ClueWeb12. For reference, 
we also report the throughput of IRLbot [2], although the latter is not publicly available. 


e Following UbiCrawler’s design [1], BUbiNG agents are 
identical and autonomous. The assignment of URLs to 
agents is entirely customizable, but by default we use 
consistent hashing as a fault-tolerant, self-configuring 
assignment function. 


We now provide a few highlights on data structures that are 
novel and central in the design of BUbiNG. 


Workbench. It is an in-memory data structure that con- 
tains the next URLs to be visited. It is one of the main 
novel ideas in BUbiNG’s design. 

URLs associated with a specific host are kept in a struc- 
ture called visit state, containing a FIFO queue of the next 
URLs to be crawled for that host along with a next-fetch 
field that specifies the first instant in time when a URL from 
the queue can be downloaded, according to the host polite- 
ness configuration. Visit states are further gathered by IP 
address in workbench entries: a workbench entry contains 
a queue of visit states sharing a common IP, prioritized by 
their next-fetch field, and an IP-specific next-fetch, con- 
taining the first instant in time when the IP address can 
be accessed again, according to the IP politeness configura- 
tion. The workbench is the queue of all workbench entries, 
prioritized on the next-fetch field of each entry maximized 
with the next-fetch field on the top element of its queue 
of visit states. In other words, the workbench is a prior- 
ity queue of priority queues of FIFO queues. Due to our 
choice of priorities there is a host that can be visited with- 
out violating host or IP politeness if and only if the first 
URL of the top visit state of the top workbench entry can be 
visited. This approach improves significantly over IRLBot’s 
two-queues technique [2], as it can detect in constant time 
the next URL to process. 


Cache and sieve. To keep track of already-seen URLs, 
every time a URL is discovered it is checked against a high- 
performance approximate LRU cache containing 128-bit fin- 
gerprints: more than 90% of the URLs discovered are dis- 
carded at this stage. The cache has also another important 
goal: it avoids that frequently found URLs assigned to an- 
other agent are retransmitted several times. URLs that pass 
the cache check are enqueued to a sieve, a data structure 
originally used by Mercator [4] that stores fingerprints of 
the set of seen URLs on disk and merges them periodically 
with a set accumulated in RAM, emitting new URLs that 
must be crawled. We tested an alternative sieve described 
in [2], the DRUM (a extension of the Mercator sieve) but 
DRUM destroys the breadth-first order of the visit, and we 
found no performance advantages with respect to a standard 
Mercator sieve coupled with our cache. 


Distributor. It is a high-priority thread that processes 
URLs that have been emitted by the sieve. The main task 
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of the distributor is to dequeue iteratively a URL from the 
sieve, checking whether it belongs to a host for which a visit 
state already exists, and then either creating a new visit 
state or enqueuing the URL to an existing one. If a new 
visit state is necessary, it is passed to a set of DNS threads 
that perform DNS resolution and then move the visit state 
on the workbench. Since, however, breadth-first visit queues 
grow exponentially, and the workbench can use only a fixed 
amount of in-core memory, it is necessary to virtualize the 
workbench, that is, writing on disk part of the URLs coming 
out of the sieve. In the first versions of BUbiNG, we tried 
designs inspired by the BEAST module of IRLbot [2], which 
however is only vaguely specified; moreover, BEAST-based 
implementations performed poorly unless we discarded all 
sites generating errors. Currently, BUbiNG uses a sophis- 
ticated memory-mapped system that can handle millions of 
on-disk FIFO queues by appending elements in a log-like 
fashion and periodically collecting unused space. Alterna- 
tively, if page-level prioritization is necessary, BUbiNG can 
virtualize the workbench using the Berkeley DB. 


3. EXPERIMENTS 


We ran two kinds of experiments: one batch was per- 
formed in vitro with a HTTP proxy simulating network 
connections towards the web and generating fake HTML 
pages. Four agents (with IP delay 500 ms and host delay 4s) 
downloaded on average 36600 pages per second. A second 
batch of experiments was run at iStella, an Italian commer- 
cial search engine that kindly provided us with a 48-core, 
512 GB RAM machine with a 2 Gb/s link that we were able 
to fully saturate, downloading 5400 pages per second using 
a single agent. Table 1 reports comparison with public data 
about other crawlers. 
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